Reconfigurable fabric accessing external memory

ABSTRACT

Techniques are disclosed for data manipulation. A memory request is initiated from a first cluster within a reconfigurable fabric. The memory request is queued within a first FIFO for execution by an external memory direct memory access (DMA) master. The memory request is issued to an external memory. The external memory is associated with the external memory DMA master. The external memory is accessed based on the memory request. Data is transferred between the external memory and a second cluster within the reconfigurable fabric, facilitated by a second FIFO. The queuing is accomplished by a queue manager, the first FIFO, and the second FIFO. The first FIFO and the second FIFO are controlled by circular buffers. A read clock for the first FIFO is based on a clock for the first FIFO&#39;s circular buffer, whereas a write clock is based on a clock for the external memory.

RELATED APPLICATIONS

This application claims the benefit of U.S. provisional applications “Direct Memory Access within a Reconfigurable Fabric” Ser. No. 62/399,785, filed Sep. 26, 2016, “Reconfigurable Fabric Direct Memory Access” Ser. No. 62/399,948, filed Sep. 26, 2016, “Reconfigurable Fabric Accessing External Memory” Ser. No. 62/399,964, filed Sep. 26, 2016, and “Reconfigurable Fabric Direct Memory Access with Multiple Read or Write Elements” Ser. No. 62/440,248, filed Dec. 29, 2016.

Each of the foregoing applications is hereby incorporated by reference in its entirety.

FIELD OF ART

This application relates generally to data manipulation and more particularly to reconfigurable fabric accessing external memory.

BACKGROUND

Modern electronic systems impact the lives of a wide variety of people on a daily basis. These electronic systems support communication, provide information, etc. At the heart of many modern electronic systems are integrated electronic circuits, or “chips”, as they are often called. These chips are designed to perform a wide variety of functions in the electronic systems and enable those systems to perform their functions effectively and efficiently. The chips are based on highly complex circuit designs, system architectures, and implementations, and are integral to the electronic systems. The chips implement functions such as communications, processing, and networking, whether the electronic systems are applied to business, entertainment, or consumer electronics purposes. A typical electronic system includes more than one chip. The chips implement critical functions including computation, storage, and control. The chips support the electronic systems by computing algorithms and heuristics, handling and processing data, communicating internally and externally to the electronic system, and so on. Since there are so many computations that must be performed, any improvements in the efficiency of the computations have a significant and substantial impact on overall system performance. As the amount of data to be handled increases, the approaches that are used must not only be effective, efficient, and economical, but must also scale as the amount of data increases.

Single processor architectures are well-suited for some tasks, but are unable to provide the level of performance required by some high-performance systems. Parallel processing based on general-purpose processors can attain an increased level of performance. Thus, using systems with multiple processing elements is one approach for achieving increased performance. There is a wide variety of applications that demand a high level of performance. Such applications can include networking, image processing, simulations, and signal processing, to name a few. In addition to computing power, flexibility is also important for adapting to ever-changing business needs and technical situations.

Some applications demand reconfigurability. Reconfigurability is an important attribute in many processing applications, as reconfigurable devices are extremely efficient for certain types of processing tasks. In certain circumstances, reconfigurable devices have cost and performance advantages over other devices because reconfigurable logic enables program parallelism, allowing for multiple computation operations to occur simultaneously for the same program. Meanwhile, conventional processors are often limited by instruction bandwidth and execution restrictions. Typically, the high-density properties of reconfigurable devices come at the expense of the high-diversity property that is inherent in microprocessors. Microprocessors have evolved to a highly optimized configuration that can provide cost/performance advantages over reconfigurable arrays for certain tasks with high-functional diversity. However, there are many tasks for which a conventional microprocessor might not be the best design choice. An architecture supporting configurable interconnected processing elements can be a viable alternative in many data intensive applications, especially for moving and processing incredibly large amounts of data. Data-driven applications demand a whole new architecture of computing structures to meet the throughput and processing needs contained therein.

The emergence of reconfigurable computing has enabled a higher level of both flexibility and performance of computer systems. Reconfigurable computing combines the high speed of application-specific integrated circuits with the flexibility of programmable processors. This provides much-needed functionality and power to enable the technology used in many current and upcoming fields.

SUMMARY

Reconfigurable fabrics that can include arrays or clusters of processing elements, switching elements, controlling elements, interface elements, etc., have many applications where high speed transferring and processing of data is advantageous. Interfaces to the clusters can support multiple master/slave interfaces, where a master processing element can control data transfer, and a slave processing element can be a reader (sink of data), a writer (source of data), or part of the transfer datapath. The interfaces are coupled to first in first out (FIFO) blocks that provide the interfaces with custom logic and alignment between the FIFO channels and a static schedule of a row or a column of the clusters. The interface allows communication between an asynchronous fabric and a separately clocked external, synchronous memory subsystem. The slave interfaces can load programs into the clusters. Each interface can be connected to various configuration paths, where each path is buffered to support independent and concurrent operations.

Disclosed embodiments provide for improved data manipulation performance by a reconfigurable fabric accessing external memory. A reconfigurable fabric includes a plurality of clusters. The clusters can include other elements such as switching elements, processing elements, control elements, interface elements, storage elements, and so on. A memory request is initiated from a first cluster within a reconfigurable fabric. The memory request is queued within a first FIFO for execution by an external memory direct memory access (DMA) master. The memory request is issued to an external memory. The external memory is associated with the external memory DMA master. The external memory is accessed based on the memory request. Data is transferred between the external memory and a second cluster within the reconfigurable fabric, facilitated by a second FIFO. The queuing is accomplished by a queue manager, the first FIFO, and the second FIFO. The first FIFO and the second FIFO are controlled by circular buffers. A read clock for the first FIFO is based on a clock for the first FIFO's circular buffer, whereas a write clock is based on a clock for the external memory fabric.

A processor-implemented method for data manipulation is disclosed comprising: initiating a memory request from a first cluster within a reconfigurable fabric; queuing the memory request within a first FIFO for execution by an external memory direct memory access (DMA) master; issuing the memory request to an external memory, wherein the external memory is associated with the external memory DMA master; accessing the external memory based on the memory request; and transferring data between the external memory and a second cluster within the reconfigurable fabric. In embodiments, the first cluster and the second cluster are distinct clusters within the reconfigurable fabric. In other embodiments, the first cluster and the second cluster are the same cluster within the reconfigurable fabric. In embodiments, the external memory is outside the reconfigurable fabric. In embodiments, the transferring data is facilitated by a second FIFO. The queuing can be accomplished by a queue manager, the first FIFO, and the second FIFO. The first FIFO can be controlled by a first circular buffer. The first circular buffer can provide instructions for unloading data from the first FIFO to the reconfigurable fabric. A read clock for the first FIFO can be based on a clock for the first circular buffer, and a write clock for the first FIFO can be based on a clock for the external memory. The second FIFO can be controlled by a second circular buffer.

In embodiments, the transferring data comprises a read operation from the external memory, and the transferring data comprises a write operation to the external memory. In embodiments, a computer program product is embodied in a non-transitory computer readable medium for data manipulation comprising code which causes one or more processors to perform operations of: initiating a memory request from a first cluster within a reconfigurable fabric; queuing the memory request within a first FIFO for execution by an external memory direct memory access (DMA) master; issuing the memory request to an external memory, wherein the external memory is associated with the external memory DMA master; accessing the external memory based on the memory request; and transferring data between the external memory and a second cluster within the reconfigurable fabric.

Various features, aspects, and advantages of various embodiments will become more apparent from the following further description.

BRIEF DESCRIPTION OF THE DRAWINGS

The following detailed description of certain embodiments may be understood by reference to the following figures wherein:

FIG. 1 is a flow diagram for reconfigurable fabric accessing external memory.

FIG. 2 is a flow diagram for using a second cluster.

FIG. 3 illustrates an example interface architecture.

FIG. 4 shows fabric/external memory DMA transfer using an AXI interface.

FIG. 5 shows writing via a slave.

FIG. 6 illustrates concurrent read/write via a slave.

FIG. 7 shows master concurrent read/write transfers.

FIG. 8A illustrates credit generation for a reconfigurable fabric accessing external memory.

FIG. 8B is an example configuration for credit-based flow control.

FIG. 9 illustrates interconnecting reconfigurable fabrics via external memory.

FIG. 10 is a system diagram for data manipulation within a reconfigurable fabric.

DETAILED DESCRIPTION

Techniques are disclosed for data manipulation. Commercial, military, and other market segments engage the electronics and semiconductor industries to improve the semiconductor chips and systems that they design, develop, implement, fabricate, and deploy. There are many factors used to measure the improvements of the semiconductor chips that include design criteria such as the price, dimensions, speed, power consumption, heat dissipation, feature sets, compatibility, etc. These chip measurements find their ways into designs of the semiconductor chips and the capabilities of the electronic systems that are built from the chips. Many market segments including commercial, medical, consumer, educational, financial, etc., benefit from the semiconductor chips and systems that are deployed in those segments. The abilities of the chips to perform basic logical operations and to process data at high speed are fundamental to any of the chip and system applications. The abilities of the chips to transfer very large data sets have become particularly critical because of the demands of many applications.

Chip, system, and computer architectures have traditionally relied on controlling the flow of data through the chip, system, or computer. The data can be stored in memory that is located on the chip or system, or it can be stored externally to the chip or system. In these architectures, such as the classic Von Neumann architecture where memory is shared for storing instructions and data, a set of instructions is executed to process data. With such an architecture, referred to as a “control flow”, the execution of the instructions can be predicted and can be deterministic. That is, the way in which data is processed is dependent upon the point in a set of instructions at which a chip, system, or computer is operating. This processing of data occurs whether the memory is internal or external and irrespective of how the data is obtained. In contrast, a “dataflow” architecture is one in which the data controls the order of operation of the chip, system, or computer. The dataflow control can be determined by the presence or absence of data. Dataflow architectures find applications in many areas including the fields of networking and digital signal processing, as well as other areas in which large data sets must be handled, such as telemetry, graphics processing, machine learning, “big data”, etc.

Direct memory access (DMA) can be applied to improve communication between processing elements, switching elements, etc. of a fabric or cluster of such elements. Since communication such as the transfer of data from one location to another location can be a limiting factor in system performance, increased communication rate and efficiency with which data is obtain from sources such as external memory can directly impact speed. A plurality of switching elements can be included in a reconfigurable fabric. The switching elements can comprise clusters within the reconfigurable fabric. A memory request can be initiated from a first cluster within a reconfigurable fabric. The memory request can be queued within a first FIFO for execution by an external memory direct memory access (DMA) master. The memory request can be issued to an external memory. The external memory can be associated with the external memory DMA master. The external memory can be accessed based on the memory request. Data can be transferred between the external memory and a second cluster within the reconfigurable fabric, facilitated by a second FIFO. The queuing can be accomplished by a queue manager, the first FIFO, and the second FIFO. The first FIFO and the second FIFO can be controlled by circular buffers. A read clock for the first FIFO can be based on a clock for the first FIFO's circular buffer. A write clock for the first FIFO can be based on a clock for the external memory. The external memory can be a hybrid memory cube (HMC). The reconfigurable fabric can include other elements such as processing elements, control elements, interface elements, storage elements, and so on.

Dataflow processors can be applied to many applications where large amounts of data such as unstructured data are processed. Typical processing applications for unstructured data can include speech and image recognition, natural language processing, bioinformatics, customer relationship management, digital signal processing (DSP), graphics processing (GP), network routing, telemetry such as weather data, data warehousing, and so on. Dataflow processors can be programmed using software and can be applied to highly advanced problems in computer science such as deep learning. Deep learning techniques can include an artificial neural network, a convolutional neural network, etc. The success of these techniques is highly dependent on large quantities of data for training and learning. The data-driven nature of these techniques is well suited to implementations based on dataflow processors. The dataflow processor can receive a dataflow graph such as an acyclic dataflow graph, where the dataflow graph can represent a deep learning network. The dataflow graph can be assembled at runtime, where assembly can include calculation input/output, memory input/output, and so on. The assembled dataflow graph can be executed on the dataflow processor.

The dataflow processors can be organized in a variety of configurations. One configuration can include processing element quads with arithmetic units. A dataflow processor can include one or more processing elements (PE). The processing elements can include a processor, a data memory, an instruction memory, communications capabilities, and so on. Multiple PEs can be grouped, where the groups can include pairs, quads, octets, etc. The PEs arranged in arrangements such as quads can be coupled to arithmetic units, where the arithmetic units can be coupled to or included in data processing units (DPU). The DPUs can be shared between and among quads. The DPUs can provide arithmetic techniques to the PEs, communications between quads, and so on.

The dataflow processors, including dataflow processors arranged in quads, can be loaded with kernels. The kernels can be a portion of a dataflow graph. In order for the dataflow processors to operate correctly, the quads can require reset and configuration modes. Processing elements can be configured into clusters of PEs. Kernels can be loaded onto PEs in the cluster, where the loading of kernels can be based on the availability of free PEs, an amount of time to load the kernel, an amount of time to execute the kernel, and so on. Reset can begin with initializing up-counters coupled to PEs in a cluster of PEs. Each up-counter is initialized with a value −1 plus the Manhattan distance from a given PE in a cluster to the end of the cluster. A Manhattan distance can include a number of steps to the east, west, north, and south. A control signal can be propagated from the start cluster to the end cluster. The control signal advances 1 cluster per cycle. When the counters for the PEs all reach 0, then the processors have been reset. The processors can be suspended for configuration, where configuration can include loading of one or more kernels onto the cluster. The processors can be enabled to execute the one or more kernels. Configuring mode for a cluster can include propagating a signal. Clusters can be preprogrammed to enter configuration mode. A configuration mode can be entered. Various techniques, including direct memory access (DMA) can be used to load instructions from the kernel into instruction memories of the PEs. The clusters that were pre-programmed into configuration mode can be preprogrammed to exit configuration mode. When configuration mode has been exited, execution of the one or more kernels loaded onto the clusters can commence. In embodiments, clusters are reprogrammed, and, during the reprogramming, switch instructions used for routing are not interfered with so that routing continues through a cluster.

Dataflow processes that can be executed by dataflow processor can be managed by a software stack. A software stack can include a set of subsystems, including software subsystems, which might be needed to create a software platform. A complete software platform can include a set of software subsystems required to support one or more applications. A software stack can include offline operations and online operations. Offline operations can include software subsystems such as compilers, linker simulators, emulators, and so on. The offline software subsystems can be included in a software development kit (SDK). The online operations can include dataflow partitioning, dataflow graph throughput optimization, and so on. The online operations can be executed on a session host and can control a session manager. Online operations can include resource management, monitors, drivers, etc. The online operations can be executed on an execution engine. The online operations can include a variety of tools which can be stored in an agent library. The tools can include BLAS™, CONV2D™ SoftMax™, and so on.

Software to be executed on a dataflow processor can include precompiled software or agent generation. The pre-compiled agents can be stored in an agent library. An agent library can include one or more computational models which can simulate actions and interactions of autonomous agents. Autonomous agents can include entities such as groups, organizations, and so on. The actions and interactions of the autonomous agents can be simulated to determine how the agents can influence operation of a whole system. Agent source code can be provided from a variety of sources. The agent source code can be provided by a first entity, provided by a second entity, and so on. The source code can be updated by a user, downloaded from the Internet, etc. The agent source code can be processed by a software development kit, where the software development kit can include compilers, linkers, assemblers, simulators, debuggers, and so one. The agent source code that can be operated on by the software development kit can be in an agent library. The agent source code can be created using a variety of tools, where the tools can include MATMUL™, Batchnorm™, Relu™, and so on. The agent source code that has been operated on can include functions, algorithms, heuristics, etc., that can be used to implement a deep learning system.

A software development kit can be used to generate code for the dataflow processor or processors. The software development kit can include a variety of tools which can be used to support a deep learning technique or another technique which requires processing of large amounts of data such as unstructured data. The SDK can support multiple machine learning techniques such as machine learning techniques based on GEMM™, sigmoid, and so on. The SDK can include a low-level virtual machine (LLVM) which can serve as a front end to the SDK. The SDK can include a simulator. The SDK can include a Boolean satisfiability solver (SAT solver). The SDK can include an architectural simulator, where the architectural simulator can simulate a dataflow processor or processors. The SDK can include an assembler, where the assembler can be used to generate object modules. The object modules can represent agents. The agents can be stored in a library of agents. Other tools can be included in the SDK. The various techniques of the SDK can operate on various representations of a flow graph.

An agent library can include one or more computational models which can simulate actions and interactions of autonomous agents. Autonomous agents can include entities such as groups, organizations, and so on. The actions and interactions of the autonomous agents can be simulated to determine how the agents can influence operation of a whole system. Agent source code can be provided from a variety of sources. The agent source code can be provided by a first entity, provided by a second entity, and so on. The source code can be updated by a user, downloaded from the Internet, etc. The agent source code can be processed by a software development kit, where the software development kit can include compilers, linkers, assemblers, simulators, debuggers, and so one. The agent source code that can be operated on by the software development kit (SDK) can be in an agent library. The agent source code that has been operated on can include functions, algorithms, heuristics, etc., that can be used to implement a deep learning system.

Direct memory access (DMA) can be applied to improve communication between a cluster within a reconfigurable fabric and external memory. Since communication such as the transfer of data from one location to another location can be a limiting factor in system performance, increased communication rate and efficiency can directly impact speed and performance. In embodiments, a plurality of switching elements forms two or more clusters within a reconfigurable fabric. In embodiments, the two or more clusters are initialized to perform an operation based on one or more agents defined in a software development kit (SDK).

FIG. 1 is a flow diagram for reconfigurable fabric accessing external memory. The reconfigurable fabric can include datapaths, first in first out (FIFO) memories, circular buffers, etc., and other elements, such as processing elements, control elements, interface elements, and so on. A memory request is initiated from a first cluster within a reconfigurable fabric. The memory request is queued within a first FIFO for execution by an external memory direct memory access (DMA) master. The memory request is issued to an external memory. The external memory is associated with the external memory DMA master. The external memory is accessed based on the memory request. Data is transferred between the external memory and a second cluster within the reconfigurable fabric, facilitated by a second FIFO. The flow 100 comprises a processor-implemented method for data manipulation. The flow 100 includes initiating a memory request from a cluster 110. The cluster can be part of a reconfigurable fabric. The cluster can include switching elements, processing elements, storage elements, input/output elements, and so on. Note that a cluster is sometimes referred to as the element that is currently being focused on within the cluster. For example, if a cluster is being used as part of a transfer path for DMA data, then the focus is on data transfer and the cluster can be referred to as a switching element.

The flow 100 includes queuing the request in a first FIFO 120. FIFO is an industry standard term for a first-in-first-out memory structure. For example, if data A is written into a FIFO, and then many additional pieces of data are written into the same FIFO, data A would generally be the first piece of data to exit the FIFO, and then the additional pieces would exit in the order in which they were written. The FIFO enables formation of a queue structure within the reconfigurable fabric. A queue structure is a holding structure for pending events such as data reads, data writes, control operations, address transmission, and so on. The flow 100 includes issuing a memory request to external memory 130. The memory request can be for a memory read, a memory write, memory status, or other functions involving an external memory. The external memory is not generally a part of the reconfigurable fabric initiating the memory request; therefore, in embodiments, the external memory is outside the reconfigurable fabric. The external memory can be realized using DRAM, SRAM, flash memory, hybrid memory cube (HMC), or any other such memory technology.

The flow 100 includes accessing external memory 140. External memory can be accessed most typically for a read or write operation; that is, data can be read from the external memory or written into the external memory. Other operations can be possible—for example, a block erase can be performed on flash memory. The flow 100 includes transferring data between external memory and a cluster 150. The data transfer can most typically be involved in a memory read or a memory write operation, although other data transfer operations can be possible. The data transfer can be between the external memory and the cluster that initiated it. In embodiments, the data is transferred such that a second cluster receives the data from external memory 152. Alternatively, the data can be transferred such that a second cluster writes the data to external memory (not shown). In some embodiments, the first cluster and the second cluster are distinct clusters within the reconfigurable fabric while in other embodiments, the first cluster and the second cluster are the same cluster within the reconfigurable fabric. The transferring of data includes facilitating by a second FIFO 154. The second FIFO can be included in a different path among the clusters of the reconfigurable fabric. The transferring data can be controlled using credit-based based flow control 156.

The credit-based flow control provides a mechanism to avoid collisions and prevent back pressure on a FIFO 122 by ensuring that the FIFO can accept and retransmit data, addresses, and/or controls without conflicting transfers from other elements within the fabric or external to the fabric. Back pressure is the phenomenon of a FIFO having too much data trying to enter the FIFO before a commensurate amount of data leaves the FIFO. The queue control structure can slow or stop input data into a FIFO to prevent backpressure. For example, a FIFO involved with writing a large block of data out to external memory might have to wait to complete a transfer because the external memory is busy completing a different operation with a different cluster in the same or a different reconfigurable fabric. The input to the FIFO cannot accept new data, so the queue manager structure alerts the clusters involved to pause sending data until the back pressure in the queue is relieved by having the FIFO's current contents transferred out or started to be transferred out. In embodiments, the transferring data is facilitated by a second FIFO and the queuing is accomplished by a queue manager, the first FIFO, and the second FIFO.

The flow 100 includes controlling a FIFO using a circular buffer 124. The circular buffer can be based on various types of memories such as SRAM, DRAM, ROM, etc. The circular buffer can provide instructions for unloading data from a FIFO within a reconfigurable fabric. The unloading of data from the FIFO can be based on DMA. In embodiments, the first FIFO is controlled by a first circular buffer. In embodiments, the first circular buffer provides instructions for unloading data from the first FIFO to the reconfigurable fabric. In embodiments, a read for the first FIFO is based on a clock for the first circular buffer. In this way, the FIFO clock is synchronized with the circular buffer within the reconfigurable fabric. In embodiments, a write clock for the first FIFO is based on a clock for the external memory. In this way, the FIFO clock is synchronized with the external memory and not the reconfigurable fabric. In embodiments, the second FIFO is controlled by a second circular buffer. Many FIFOs can exist within the reconfigurable fabric, each controlled by its own circular buffer. In embodiments, the transferring data comprises a read operation from the external memory and a write operation to the external memory. In embodiments, the transferring data comprises a credit-based flow control, and the credit-based flow control is based on a credit counter. In embodiments, the queuing prevents back-pressure on the first FIFO.

In embodiments, the first cluster within the reconfigurable fabric and the second cluster within the reconfigurable fabric are each controlled by one or more circular buffers, and the one or more circular buffers are statically scheduled. In embodiments, the first cluster within the reconfigurable fabric initiates a memory request based on contents of one of the one or more circular buffers. In some embodiments, the second cluster within the reconfigurable fabric is a receiving cluster for the transferring data, and in other embodiments, the second cluster within the reconfigurable fabric is a sending cluster for the transferring data. In embodiments, the second cluster sends data to external memory outside the reconfigurable fabric. Various steps in the flow 100 may be changed in order, repeated, omitted, or the like without departing from the disclosed concepts. Various embodiments of the flow 100 can be included in a computer program product embodied in a non-transitory computer readable medium that includes code executable by one or more processors.

FIG. 2 is a flow diagram for using a second cluster. The second cluster can be part of a reconfigurable fabric. The reconfigurable fabric can include datapaths, first in first out (FIFO) memories, circular buffers, etc., and other elements, such as processing elements, control elements, interface elements, and so on. A memory request is initiated from a first cluster within a reconfigurable fabric. The memory request is queued within a first FIFO for execution by an external memory direct memory access (DMA) master. The memory request is issued to an external memory. The external memory is associated with the external memory DMA master. The external memory is accessed based on the memory request. Data is transferred between the external memory and a second cluster within the reconfigurable fabric, facilitated by a second FIFO. The flow 200 comprises a processor-implemented method for data manipulation. The flow 200 includes allocating a second cluster within the reconfigurable fabric 210. The first cluster within the reconfigurable fabric can be the initiating cluster. The initiating cluster can designate a second cluster as the receiving cluster 220 for the data transfer direct memory access (DMA) operation being initiated.

The flow 200 includes the second cluster receiving data from external memory 230. The data received from external memory does itself not flow back to the first, initiating cluster, although a token of status or completeness might. The flow 200 includes designating a second cluster as a sending cluster 240. Data from the second cluster can be sent to external memory 250. Thus, a second cluster in flow 200 can either be designated as a receiving cluster or a sending cluster. In embodiments, a second cluster is designated as a receiving cluster and a third cluster is designated as a sending cluster. In some embodiments, the sending cluster and the receiving cluster are the same cluster. In embodiments, the first cluster within a reconfigurable fabric and the second cluster within the reconfigurable fabric are synchronized within the reconfigurable fabric.

In embodiments, the external memory comprises a hybrid memory cube (HMC). In embodiments, a second reconfigurable fabric is connected to the reconfigurable fabric through the HMC. In embodiments, each cluster within a reconfigurable fabric is self-timed and each cluster being self-timed is self-clocked. In embodiments, each cluster within a reconfigurable fabric is synchronized to be within a tic cycle of each other cluster within a reconfigurable fabric. In embodiments, the data includes valid data. In embodiments, the valid data wakes a processing element when the valid data arrives at the processing element. In embodiments, the data includes non-empty data. Various steps in the flow 200 may be changed in order, repeated, omitted, or the like without departing from the disclosed concepts. Various embodiments of the flow 200 can be included in a computer program product embodied in a non-transitory computer readable medium that includes code executable by one or more processors.

FIG. 3 illustrates an example interface architecture 300. A plurality of switching elements includes a reconfigurable fabric. A first FIFO memory is coupled to the reconfigurable fabric by a first datapath from the first FIFO memory to a first switching element, from the plurality of switching elements. An interface receives data from an external memory and provides the data to the first FIFO memory by a second datapath. The first FIFO memory, the first circular buffer, the first datapath, and the second datapath are used in direct memory access (DMA) of the external memory. The plurality of switching elements can be contained within a plurality of clusters within the reconfigurable fabric. The reconfigurable fabric can include other elements such as processing elements, control elements, interface elements, storage elements, and so on.

The reconfigurable fabric can support multiple master interfaces and multiple slave interfaces. In embodiments, a reconfigurable fabric includes processing elements, clusters, and interfaces to the advanced extensible interface (AXI). The AXI interfaces can be distributed around the periphery of an array of processing elements for interconnection purposes. Each interface has an AXI master load/store unit, here referred to as AXIM 312 (or AMI, for AXI Master Interface) and one slave unit, here referred to as AXIS 310. The AXIM 312 and AXIS 310 interfaces can be connected to first-in-first-out (FIFO) blocks 330, 332, 334, and 336. While four FIFO blocks are shown, more or fewer FIFO blocks can be included. The FIFO blocks can provide an interface to logic within a reconfigurable fabric and can enable an alignment between the FIFO channels and a static schedule of a row of clusters or a static schedule of a column of clusters in a reconfigurable fabric. In embodiments, each cluster within the plurality of clusters is self-timed. A self-timed cluster can operate independently of a system clock. Each cluster being self-timed can be self-clocked. A self-timed cluster can generate its own clock signals. The self-timed cluster can be based on various circuit types that can include logic families such as null convention logic (NCL). Each cluster within the plurality of clusters can be synchronized to be within a tic cycle of each other cluster within the plurality of clusters. A tic cycle can be a clock cycle, a clock tick, a self-clocked cycle, a self-timed cycle, a system cycle, and so on. The tic cycle can be applied to master interfaces, slave interfaces, and so on. The tic cycle can include operations such as read operations and write operations. A read for the first FIFO memory can be based on a clock for the first circular buffer.

Slave interfaces can load programs into the reconfigurable fabric. Each slave interface can be operated independently and concurrently. Special channels can be provided to enable the loading of programs. Each special interface, such as interface AXIS 310 and interface AXIM 312, can be connected to one or more configuration paths. Each configuration path can be buffered in order to enable independent operation and concurrent operation. Interface architectures such as architecture 300 can operate independently from and concurrently with similar interface architectures.

A program can be based on instructions that can be stored in a memory. In embodiments, the instructions are stored in a circular buffer. The circular buffer can be incremented (moved forward) to reach a later instruction or decremented (moved backward) to reach an earlier instruction. The circular buffer can be statically scheduled. Since the buffer is circular, instructions can be executed in the same order indefinitely by starting execution of the instructions at the same point in the circular buffer. If the instructions are to be executed from a different starting point, the first instruction to be executed can be set to be the starting point in the circular buffer. If a different set of instructions is to be executed, the previous contents of the circular buffer can be overwritten, replaced etc. by a new set of instructions. The circular buffer can be dynamically scheduled. The instructions in the circular buffer can perform a variety of operations. In embodiments, the first circular buffer provides instructions for unloading data from the first FIFO memory to the reconfigurable fabric.

An example interface 300 can include a master load/store interface AXIM 312 and a slave interface AXIS 310. The interfaces 310 and 312 can serve as the interfaces between the reconfigurable fabric and an external memory. The external interface 300 can include switching elements such as a 4-2 switch 320 for uploading data, and a 2-4 switch 322 for receiving data. In embodiments, the second switching element within the plurality of switching elements initiates access to the external memory. The switching elements can be separate elements in the reconfigurable fabric. In embodiments, the second switching element and the first switching element are combined. The sizes of the switches 320 and 322 can be dependent on the numbers of AXIS and AXIM interfaces, the numbers of FIFOs, and so on. The example interface 300 can include a FIFO, such as FIFO 330. In embodiments, the interface includes a second FIFO memory coupled to the reconfigurable fabric, such as FIFO 332. Other FIFOs, such as FIFOs 334 and 336 can also be coupled to the reconfigurable fabric. The FIFOs, such as FIFOs 330, 332, 334, and 336, can be coupled to the switches 320 and 322. The switch 320 can be a 4 to 2 switch; that is, it can take four input channels, or ports, and route them to one or both of two output channels, depending on the needs of the fabric. The switch 322 can be a 2 to 4 switch; that is, it can take two input channels, or ports, and route them to one or more of four possible output ports, depending on the needs of the fabric. The FIFOs can be coupled to the reconfigurable fabric (not shown). The FIFOs can be used as buffers, where the buffers can be used to handle differing clock rates, data transfer rates, etc., between the datapaths of the reconfigurable fabric and the external memory. The one or more buffers can be separate from the FIFOs. In embodiments, a buffer is included in the second datapath. FIFOs 330, 332, 334, and 336 can each have a port from which to transmit data (Tx) and a port with which to receive data (Rx).

The interface can receive data from an external memory, where an external memory can be a memory that is outside the reconfigurable fabric. The external memory can be a RAM, a ROM, a DRAM, an SRAM, an optical memory, and so on. Any of the FIFOs, such as FIFOs 330, 332, 334, and 336, can provide a request to the external memory. In embodiments, the second FIFO memory provides a request to the external memory. In order to access the external memory, an address can be required. In embodiments, the second FIFO memory provides information on a location requested from the external memory. A request to the external memory can also indicate an amount of data to be read or written. The data from the external memory can include a block of data. A block of data can be based on a quantity of bits, bytes, words, and so on. The information provided on the location can include addresses. The block of data can include a beginning address for the external memory. The beginning address can be based on a physical block location, a logical block location, and so on. The block of data can include an ending address for the external memory. The data can span multiple blocks. The amount of data might not completely fill a block, the last block of multiple blocks, and so on. The block of data can include an end-of-block location. Example AXIS architected signals ARS, RS, AWS, WS, and BS are shown as inputs and outputs to AXIS 310. Example AXIM architected signals AWM, WM, BM, ARM, and RM are shown as inputs and outputs to AXIM 312. Other types of external interfaces will have other architected signals. Example FIFO connections through FIFOs 330, 332, 334, and 336 are to switch elements within the clusters of the reconfigurable fabric and are shown as paths L2, which in some embodiments indicate interconnection with a level 2 switch within the fabric to enable sending and receiving data between the fabric and the AXI interfaces.

The data received from the external memory via an interface such as AXIS 310 and AXIS 312 can travel through a switch such as 320 and 320, along a datapath to a FIFO such as 330, 332, 334, and 326. From the FIFOs 330, 332, 334, and 336, the data can travel to the reconfigurable fabric. The sending of the data from the external memory and the writing of the data to a FIFO can be coordinated to ensure proper reception of the data. In embodiments, a write clock for the first FIFO memory is based on a clock for the external memory. A write clock for other FIFO memories can be similarly based on a clock for the external memory. The FIFO can receive data from the external memory where the data can include valid data. Data can be confirmed to be valid data using various techniques such as a status signal, meeting timing requirements, and so on. The data can include non-empty data.

At times, the transfer of data from the reconfigurable fabric to an external memory can encounter a state where the data to be transferred is invalid, has not yet arrived, is empty, and so on. The transfer of data can be suspended and a processing element can be placed into a sleep state. The processing element can be woken from a sleep state as a result of an event. An event can include the arrival of valid data, non-empty data, etc. In embodiments, the valid data wakes a processing element when the valid data arrives at the processing element.

In embodiments, one or more processing elements of one or more clusters of processing elements are placed into a sleep state. A processing element can enter a sleep state based on processing an instruction that places the processing element into the sleep state. The processing element can be woken from the sleep state as a result of valid data being presented to the processing element of a cluster. Recall that a given processing element can be controlled by a circular buffer. The circular buffer can contain an instruction to place one or more of the processing elements into a sleep state. The circular buffer can remain awake while the processing element controlled by the circular buffer is in a sleep state. In embodiments, the circular buffer associated with the processing element is placed into the sleep state along with the processing element. The circular buffer can wake along with its associated processing element. The circular buffer can wake at the same address as when the circular buffer was placed into the sleep state, at an address that can continue to increment while the circular buffer was in the sleep state, etc. The circular buffer associated with the processing element can continue to cycle while the processing element is in the sleep state, but instructions from the circular buffer might not be executed. The sleep state can include a rapid transition to sleep state capability, where the sleep state capability can be accomplished by limiting clocking to portions of the processing elements. In embodiments, the sleep state includes a slow transition to sleep state capability, where the slow transition to sleep state capability is accomplished by powering down portions of the processing elements. The sleep state can include a low power state.

The obtaining data from a first processing element and the sending the data to a second processing element can include a direct memory access (DMA). A DMA transfer can continue while valid data is available for the transfer. A DMA transfer can terminate when it has completed without error or when an error occurs during operation. Typically, a cluster that initiates a DMA transfer will request to be brought out of a sleep state when the transfer is completed. This waking is achieved by setting control signals that can control the one or more processing elements. Once the DMA transfer is initiated with a start instruction, a processing element or processing element in the cluster can execute a sleep instruction to place itself to sleep. When the DMA transfer terminates, the processing elements and/or processing elements in the cluster can be brought out of sleep after the final instruction is executed. Note that if a control bit can be set in the register of the cluster that is operating as a slave in the transfer, that cluster can also be brought out of sleep state (if it is asleep during the transfer).

The cluster that is involved in a DMA and can be brought out of sleep after the DMA terminates can determine that it has been brought out of a sleep state based on the code that is executed. A cluster can be brought out of a sleep state based on the arrival of a reset signal and the execution of a reset instruction. The cluster can be brought out of sleep by the arrival of valid data (or control) following the execution of a switch instruction. A processing element or processing element can determine why it was brought out of a sleep state by the context of the code that the element starts to execute. A cluster can be awoken during a DMA operation by the arrival of valid data. The DMA instruction can be executed while the cluster remains asleep as the cluster awaits the arrival of valid data. Upon arrival of the valid data, the cluster is woken and the data stored. Accesses to one or more data random access memories (RAM) can be performed when the processing elements and the processing elements are operating. The accesses to the data RAMs can also be performed while the processing elements and/or processing elements are in a low-power sleep state.

FIG. 4 shows fabric/external memory DMA transfer using an AXI interface. A read transfer can be included as a DMA operation initiated in a reconfigurable fabric. The reconfigurable fabric can include datapaths, first in first out (FIFO) memories, circular buffers, etc., and other elements, such as processing elements, control elements, interface elements, and so on. A memory request is initiated from a first cluster within a reconfigurable fabric. The memory request is queued within a first FIFO for execution by an external memory direct memory access (DMA) master. The memory request is issued to an external memory. The external memory is associated with the external memory DMA master. The external memory is accessed based on the memory request. Data is transferred between the external memory and a second cluster within the reconfigurable fabric, facilitated by a second FIFO. The diagram 400 can comprise a processor-implemented method for data manipulation. A reconfigurable fabric can support multiple interfaces, including interfaces that can support master clusters and interfaces that can support slave clusters. A master cluster can control the flow of data, instructions, etc. within the reconfigurable fabric. A slave cluster can act as a source, or a writer cluster, and as a sink, or a reader cluster. The interfaces can be distributed around the reconfigurable fabric including around the periphery of the reconfigurable fabric. The interfaces can be connected to blocks and can provide the interface to custom logic, an alignment between FIFO channels, and a static schedule of rows or columns of clusters.

A cluster can initiate a read from an external address. For example, a cluster 410 can initiate a write to a cluster 412 using data read from external memory. The cluster 410 requesting the read assembles a 128-b queue manager (QM) request and sends it to the QM unit 430 via an uplink data channel 420. The request can be passed through other clusters in the fabric, such as the cluster 413 and the cluster 415. The QM 430 will send a response back to the initiating cluster 410 using paths among the cluster preconfigured in the reconfigurable fabric by an agent designed for a particular task in a particular set of clusters. In this example, the QM 430 has a response path available through Rx FIFO 426, cluster 417, cluster 419, cluster 421, and cluster 423. The QM 430 can also send the Tx FIFO 420 credit values along the same data path; that is, through Rx FIFO 426 and clusters 417, 419, 421, 423, 410, 413, and 415. A read for a first FIFO memory can be based on a clock for a first circular buffer. If the request is accepted, the QM sends the address and other information to the direct memory access (DMA) controller 450 so that the DMA controller 450 can initiate a read from, or write to, the cluster 412 from external memory (not shown). In the case of a read transfer, the data read from the external memory is written to the destination cluster 412. The data can be written once it is received at AMI 440, passed through the DMA controller 450, through Rx synchronization FIFO 422. The data is then passed through the cluster 425 to the destination cluster 412, where it is written for further/future use. The data can include non-empty data. The data can include valid data. When the data to be read becomes available at the advanced extensible interface (AXI) master load/store unit (AMI) 440, the DMA controller 450 will initiate the DMA write transfer to the destination cluster 412 and stream the data into the DMA path. The streaming of data into the DMA path can use Rx FIFO 422 for the DMA channel for managing the flow control. The DMA controller 450 can manage DMA transfers when the DMA transfers are rejected by the source cluster 410 and/or by the destination cluster 412. The DMA controller 450 contains a retry count register that will retry a transfer a number of times before signalling an error. Therefore, once the external memory executes a read operation based on the AXI master address/control path from AMI 440, the data will be returned at a later time—asynchronously to the fabric—to AMI 440 on the AXI master read data path. The read data is then passed to the destination cluster as described above.

Returning now to diagram 400, a write transfer can be included as a DMA operation initiated in a reconfigurable fabric. The read and write transfers can use credit-based, lossless flow control. A plurality of switching elements includes a reconfigurable fabric. A first FIFO memory is coupled to the reconfigurable fabric by a first datapath from the first FIFO memory to a first switching element, from the plurality of switching elements. An interface receives data from an external memory and provides the data to the first FIFO memory by a second datapath. The first FIFO memory, the first circular buffer, the first datapath, and the second datapath, are used in direct memory access (DMA) of the external memory. The plurality of switching elements can be contained within a plurality of clusters within the reconfigurable fabric. The reconfigurable fabric can include other elements such as processing elements, control elements, interface elements, storage elements, and so on. The reconfigurable fabric can configure the various types of processing elements to form a machine. Multiple machines can be formed from the clusters within a single reconfigurable fabric. Each cluster within the plurality of clusters can be self-timed, and each cluster being self-timed can be self-clocked. The self-timing can be based on the logic family from which the reconfigurable fabric can be formed. The self-timed logic family can include null convention logic (NCL).

A cluster can send data in an uplink or transmit (Tx) direction through a channel that can be configured as a streaming data channel. A credit-based flow control technique can be used to regulate the flow of data being sent by the cluster. The credit-based flow control technique can include loss-less transfer of data. As in the read data case, the cluster 410 initiates a transfer. The instructions executing in the cluster 410 will initiate a DMA transfer for the cluster 412, which in this example is the sending cluster. The cluster 410 can send instructions and addresses for the transfer to external memory using a datapath through the cluster 413 and the cluster 415 to Tx FIFO 420. Tx FIFO 420 passes the information to QM 430, which shares it with the DMA controller 450 and an AXI slave (not shown). The AXI slave (not shown) then passes the information to the external memory per the AXI slave protocol once transmit data is available at the AXI slave from the DMA controller 450. The transmit data arrives at the DMA controller 450 from the sending cluster 412 by passing through cluster 419, cluster 429, cluster 431, and Tx FIFO 424. Again, note that paths among the clusters are preconfigured in the reconfigurable fabric by an agent designed for a particular task in a particular set of clusters. Note also that for this particular agent, the cluster 427 is not involved. However, the cluster 427 may be involved for other agents scheduled in one or more of the same clusters shown in example diagram 400.

The AXI slave (not shown) will send the transmit instructions and data to the external memory over the AXI interface. In due time, the external memory will respond with a completion indication, or status, that must be returned to the initiating cluster 410. The response path will be through the AXI slave interface to the DMA controller 450. The response path continues through Rx FIFO 422, cluster 425, cluster 412 (which is also the sending cluster in this example), cluster 419, cluster 421, cluster 423, and finally to cluster 410. A credit count is maintained by a credit accumulator credit-based flow control block 452, within the DMA controller 450, and a credit count credit-based flow control block 414, within the initiating cluster 410. The credit count can be equal to the number of slots in the Tx FIFO 420 that are known to be available. The credit count can be initialized to the size of the Tx FIFO 420. Data can only be sent to the uplink data path in FIFO 420 when the credit count is positive. When data is sent, the credit count can be decremented. As long as the credit count remains positive, the data can be sent at any time permitted by a static schedule for the uplink data path through Tx FIFO 420. When data is dequeued from the Tx FIFO 420 by the advanced extensible interface (AXI), either an AXI master 440 or an AXI slave (not shown), the credit for data removed from the queue can be sent back to the cluster 410 through QM 430, Rx FIFO 426, cluster 417, cluster 419, cluster 421, cluster 423, and finally to the initiating cluster 410. The credit for data removed value can fall between 1 and 4 and can correspond to the number of data records that were read from the Tx FIFO 420. The credit count credit-based flow control block 414 can determine when to inject credit values into the reverse path to the source cluster for each channel. The credit accumulator credit-based flow control block 452 can be included in the DMA controller 450 at the AXI interface block to keep track of the total accumulated credit between each reverse path opportunity. The credit accumulator within DMA controller 450 is reset to 0 whenever the accumulated credit is successfully read and sent to credit count credit-based flow control block 414. The accumulator value is sent back to the source cluster via the reverse path and the credit bit is set to ‘1’ to signify a valid credit value.

Channels can be uplink streaming channels and can be downlink streaming channels. The uplink streaming channels and the downlink streaming channels can be used to transfer data from and to transfer data to the reconfigurable fabric. The streaming of data in channels, whether uplink streaming channels or downlink streaming channels, can be controlled by status indicators. A status indicator can indicate whether an error or another condition has occurred. An error or condition can include a slave error, such as there is not having data. An error such as lack of data can be circumvented by ensuring that read requests are received only when there is valid data available to be read. Such a status can be indicated by reporting that valid data is available and the quantity of data that is available.

Flow control can also be accomplished for credit-based DMA read operations in the diagram 400. A channel can be configured as a DMA channel. A channel configured as a DMA channel can require a different flow control mechanism from regular data channels. The interfaces can act as master interfaces for DMA transfer through the reconfigurable fabric. For example, if a read request is made to the reconfigurable fabric to a channel configured as DMA, the read transfer is mastered by the DMA controller 450 and the DMA controls within the AXI interface. The master includes a credit count in the credit-based flow control block 452 that keeps track of the number of records in the Tx FIFO 420 that are known to be available. The credit count is initialized to the size of the Tx FIFO 420. When a data record is removed from the Tx FIFO 420, the credit count is increased. If the credit count is positive, and the DMA transfer is not complete, an empty data record can be inserted into the Rx FIFO 422. A memory bit can be set to indicate that the data record should be populated with data by the source cluster. If the credit count is 0 (meaning the Tx FIFO 420 is full), no records are entered into the Rx FIFO 422. The memory bit can be reset to 0 in order to prevent the controller in the source cluster from sending more data.

FIG. 5 shows writing via slave. The fabric can be accessed by external memory using the AXI slave functionality. A plurality of switching elements includes a reconfigurable fabric. A first FIFO memory is coupled to the reconfigurable fabric by a first datapath from the first FIFO memory to a first switching element, from the plurality of switching elements. An interface receives data from an external memory and provides the data to the first FIFO memory by a second datapath. The first FIFO memory, the first circular buffer, the first datapath, and the second datapath, are used in direct memory access (DMA) of the external memory. Transfers of data, instructions, etc., can include writing the data, instructions, and so on via the illustration 500. Transfers, including write transfers, can be initiated by a master cluster (not shown). The slave cluster, whether a source (writer slave) or a sink (reader slave) can be controlled by the master cluster. Write operations can be accomplished through an advanced extensible interface (AXI) that can be assigned as an AXI slave interface (AXIS) 510.

A master cluster can initiate a write into the reconfigurable fabric. The master cluster can request a transaction such as a write request by requesting a transaction to the AXI slave (AXIS) 510. Path 540 is a path for a write access to the reconfigurable fabric. The address of the access determines which AXI slave block (in this example AXIS 510) can receive the request. One of four FIFO interfaces can be selected via a 4-2 switch 520. The incoming data is steered through the 2-4 switch 522 along the path 540 to the addressed FIFO block 530, where the incoming data is loaded into the receiver (Rx) FIFO with the selected channel ID. The IRAM in the particular FIFO block can contain the appropriate entry points into the fabric for the addressed channel. A sufficient quantity of entry points can be required to match the bandwidth of the input data. A write clock for a first FIFO memory can be based on a clock for the external memory from which the data and/or instructions can be obtained for the writing.

Each slave interface such as AXIS 510 manages four FIFO interfaces. Each FIFO interface contains up to 15 data channels. A slave can manage write (and read) queues for up to 60 channels. Each channel can be programmed to be a DMA channel or a streaming data channel. DMA channels can be managed using a DMA protocol. Streaming data channels can maintain their own form of flow control using the status of the Rx FIFOs. The example AXIS architected signals ARS, RS, AWS, WS, and BS are shown as inputs and outputs to AXIS 510. The example AXIM architected signals AWM, WM, BM, ARM, and RM are shown as inputs and outputs to AXIM 512. Other types of external interfaces can have other architected signals. The example FIFO connections through FIFOs 530, 532, 534, and 536 to switch elements within the clusters of the reconfigurable fabric are shown as paths L2, which in some embodiments, indicate that a level 2 memory cache can be used to send data to and receive data from the fabric to the AXI interfaces.

FIG. 6 illustrates concurrent read/write via slave 600. The fabric can be accessed by external memory using the AXI slave functionality. A plurality of switching elements includes a reconfigurable fabric. A first FIFO memory is coupled to the reconfigurable fabric by a first datapath from the first FIFO memory to a first switching element, from the plurality of switching elements. An interface receives data from an external memory and provides the data to the first FIFO memory by a second datapath. The first FIFO memory, the first circular buffer, the first datapath, and the second datapath, are used in direct memory access (DMA) of the external memory. Transfers of data, instructions, etc., can include reading and writing the data, instructions, and so on. Transfers, whether read transfers or write transfers, can be initiated by a master cluster. The slave cluster, whether a source (writer slave) or a sink (reader slave) can be controlled by the master cluster. Concurrent reads and writes can be accomplished through an advanced extensible interface (AXI) that can be assigned as an AXI slave interface (AXIS) 610.

A master can request a read and/or write from/to the reconfigurable fabric. The master can request a transaction to the AXI slave (AXIS) 610. A write path 640 to write data into the fabric from external memory and a read path 642 to read data out of the fabric for consumption by the external memory can be established through the AXIS 610. The address of a write access request determines which AXI slave block (in this example, AXIS 610) receives the request and selects one of four interfaces via a 2-4 switch 622. The incoming data is steered through the 2-4 switch 622 via the path 642 to the addressed block where it is loaded into the receive (Rx) FIFO 632 with the selected channel ID. A write clock for a first FIFO memory can be based on a clock for an external memory. The address of a read access request determines which AXI slave block (in this example, AXIS 610) receives the request and selects one of four interfaces via a 4-2 switch 620. The incoming data is steered via the path 640 through the 4-2 switch 620 from the addressed block through AXIS 610 with the selected channel ID. A read for a first FIFO memory can be based on a clock for a first circular buffer. An IRAM in the FIFO contains the appropriate entry points into the fabric for the addressed channel. A sufficient number of entry points is required to match the bandwidth of the input data. Concurrent read/write paths for both the AXI slave 610 blocks in the interface can be supported. The AXI slave 610 can process multiple outstanding read and write transactions, for example, along the write data path 640 and read data path 642. A write transaction is routed to FIFO 632, while the read operation is routed to FIFO 636.

Each slave interface (e.g. 610) manages four FIFO interfaces. Each slave interface contains up to 15 data channels. A slave can manage read/write queues for up to 60 channels. Each channel can be programmed to be a DMA channel or a streaming data channel. DMA channels are managed using a DMA protocol. Streaming data channels can maintain their own form of flow control using the status of the receive (Rx) FIFOs. The flow control can be supported using a query technique. Read requests to slave interfaces can use a flow control technique. The example AXIS architected signals ARS, RS, AWS, WS, and BS are shown as inputs and outputs to AXIS 610. The example AXIM architected signals AWM, WM, BM, ARM, and RM are shown as inputs and outputs to AXIM 612. Other types of external interfaces can have other architected signals. The example FIFO connections through FIFOs 630, 632, 634, and 636 to switch elements within the clusters of the reconfigurable fabric are shown as paths L2, which in some embodiments, indicate that a level 2 memory cache can be used to send data to and receive data from the fabric to the AXI interfaces.

FIG. 7 shows master concurrent read/write transfers 700. The fabric can initiate a read or write operation with the external memory using the AXIM functionality. A plurality of switching elements includes a reconfigurable fabric. A first FIFO memory is coupled to the reconfigurable fabric by a first datapath from the first FIFO memory to a first switching element, from the plurality of switching elements. An interface receives data from an external memory and provides the data to the first FIFO memory by a second datapath. The external memory can be outside the reconfigurable fabric. The first FIFO memory, the first circular buffer, the first datapath, and the second datapath, are used in direct memory access (DMA) of the external memory. Transfers of data, instructions, etc., can include reading and writing the data, instructions, and so on. A master cluster wanting to read or write can request a transaction to an advanced extensible interface (AXI) slave. Each block can contain two AXI master interfaces, such as an AMI or AXIM (two master configurations not shown) or a master AXI interface and a slave AXI interface, as shown in the illustration 700.

An AXI master interface can be assigned as a master cluster interface such as AXIM 712. The AXIM 712 can be used for the master concurrent read/write transfers. Each AXI interface can be connected to 4 blocks through selectors such as the selectors 720 and 722 and can support 64 independently managed FIFO channels such as FIFO channels 730, 732, 734, and 736. A cluster can initiate an AXI transfer by sending a request to one of the AMI blocks via an uplink data channel such as the channel 742. The AMI block will send a response back to the cluster via the matching downlink channel. Both channels should be configured as streaming data, and the flow control in the uplink channel should be managed using the credit counter in the requesting cluster. The request includes a system address and a reconfigurable fabric address for the transfer. For a read operation 740, the data is transferred from the system address and a DMA transfer is established that writes the data to the cluster address in the destination cluster. For a write operation 742, a DMA transfer is set up to read the data from the cluster address in the source cluster and send it out to the system address. The FIFOs such as FIFOs 730 and 732 can be used to handle differing data transfer rates between sources and destinations in the reconfigurable fabric and a system. The example AXIS architected signals ARS, RS, AWS, WS, and BS are shown as inputs and outputs to AXIS 710. The example AXIM architected signals AWM, WM, BM, ARM, and RM are shown as inputs and outputs to AXIM 712. Other types of external interfaces can have other architected signals. The example FIFO connections through FIFOs 730, 732, 734, and 736 to switch elements within the clusters of the reconfigurable fabric are shown as paths L2, which in some embodiments, indicate that a level 2 memory cache can be used to send data to and receive data from the fabric to the AXI interfaces.

FIG. 8A illustrates credit generation for a reconfigurable fabric accessing external memory. The reconfigurable fabric can include datapaths, first in first out (FIFO) memories, circular buffers, etc., and other elements, such as processing elements, control elements, interface elements, and so on. A memory request is initiated from a first cluster within a reconfigurable fabric. The memory request is queued within a first FIFO for execution by an external memory direct memory access (DMA) master. The memory request is issued to an external memory. The external memory is associated with the external memory DMA master. The external memory is accessed based on the memory request. Data is transferred between the external memory and a second cluster within the reconfigurable fabric, facilitated by a second FIFO. The illustration 800 includes FIFO 810, which can load and unload data as previously described. FIFO 810 can communicate to a credit accumulator 820 using a signal 812 that data has been unloaded. The signal 812 can be used to indicate every time data is unloaded, which can adjust the credit counter to signify that another slot in the FIFO is available. The credit accumulator 820 can report accumulated credit value. The accumulated credit value can be fed back into the credit accumulator via a signal 822. FIFO 810 can have additional signals to help facilitate proper operation, especially during a fault condition. For example, FIFO 810 could output a signal indicating that asynchronism has been lost with an external memory running asynchronously to the reconfigurable fabric. FIFO 810 could also output a signal indicating that it is full.

FIG. 8B is an example configuration for credit-based flow control. In the example 802, the credit accumulator 838 is shown as part of the overall credit-based flow control. A processing element 862 within a cluster 860 sends a DMA request to external memory; in this case, HMC 840 over a path 870 through FIFO 834. FIFO 834 is controlled by a circular buffer 842, which can rotate through instructions to control the operation of FIFO 834. As data is unloaded from FIFO 834, an unload signal is sent to an accumulator 838. The unload signal can be buffered through a register 836 before signaling an adder 837 to increment the adder 837 of the accumulator 838. The output of the adder 837 can be latched for use in the next cycle by the register 836, which can then be fed back into the adder 837. A reset signal can be used to provide a reset to the accumulator 838. By way of example, whenever a record is read from a Tx FIFO (unloaded), the FIFO will report the number of valid data words in the record, and the credit count for that FIFO is increased by the number of words unloaded. The output of the accumulator 838 can be routed through a path 850 back to the processing element 862 to update the credit count kept there, thus providing a complete path for credit-based flow control.

FIG. 9 illustrates interconnecting reconfigurable fabrics via external memory. Each reconfigurable fabric can include datapaths, first in first out (FIFO) memories, circular buffers, etc., and other elements, such as processing elements, control elements, interface elements, and so on. A memory request is initiated from a first cluster within a first reconfigurable fabric. The memory request is queued within a first FIFO for execution by an external memory direct memory access (DMA) master. The memory request is issued to an external memory. The external memory is associated with the external memory DMA master. The external memory is accessed based on the memory request. Data is transferred between the external memory and a second cluster within the first reconfigurable fabric, facilitated by a second FIFO. The illustration 900 shows a first reconfigurable fabric 0 910 connected to a separate and distinct second reconfigurable fabric 1 912 through HMC 920. Other memory structures can be used to interconnect the fabrics 910 and 912. HMC 920 can use FIFO 0-1 922 and FIFO 1-0 928 to buffer and synchronize data flowing between the reconfigurable fabric 0 910 and the reconfigurable fabric 1 912. Requests for data to be read or written from/to the HMC 920 can be initiated from either the reconfigurable fabric 0 910 or the reconfigurable fabric 1 912. Likewise, either fabric can be the source or the sink for the data. HMC 920 provides a path for the asynchronous reconfigurable fabrics to share data. In embodiments, a second reconfigurable fabric is connected to the reconfigurable fabric through the HMC.

FIG. 10 is a system diagram for data manipulation within a reconfigurable fabric. The reconfigurable fabric can include datapaths, first in first out (FIFO) memories, circular buffers, etc., and other elements, such as processing elements, control elements, interface elements, and so on. A memory request is initiated from a first cluster within a first reconfigurable fabric. The memory request is queued within a first FIFO for execution by an external memory direct memory access (DMA) master. The memory request is issued to an external memory. The external memory is associated with the external memory DMA master. The external memory is accessed based on the memory request. Data is transferred between the external memory and a second cluster within the first reconfigurable fabric, facilitated by a second FIFO.

The system 1000 can include one or more processors 1010 coupled to a memory 1012 which stores instructions. The system 1000 can include a display 1014 coupled to the one or more processors 1010 for displaying data, intermediate steps, instructions, and so on. The one or more processors 1010 are attached to the memory 1012, where the one or more processors, when executing the instructions which are stored, are configured to design a data manipulation architecture including: initiating a memory request from a first cluster within a reconfigurable fabric; queuing the memory request within a first FIFO for execution by an external memory direct memory access (DMA) master; issuing the memory request to an external memory, wherein the external memory is associated with the external memory DMA master; accessing the external memory based on the memory request; and transferring data between the external memory and a second cluster within the reconfigurable fabric.

Instructions to be loaded into circular buffers, data to be loaded into first in first out (FIFO) memories, data from external memories, etc. can be stored in a data store 1020. An initiating component 1030 can provide data, instructions, etc. to a first switching element to initiate a memory request from a first cluster within a reconfigurable fabric. A queuing component 1040 can provide data, instructions, etc. to queue the memory request within a first FIFO for execution by an external memory direct memory access (DMA) master. An issuing component 1050 can provide data, instructions, etc. to issue the memory request to an external memory, wherein the external memory is associated with an external DMA master. An accessing component 1060 can provide data, instructions, etc. to access the external memory based on the memory request. A transferring component 1070 can provide data, instructions, etc. to transfer data between the external memory and a second cluster within the reconfigurable fabric.

In embodiments, a computer program product is embodied in a non-transitory computer readable medium for data manipulation comprising code which causes one or more processors to perform operations of: initiating a memory request from a first cluster within a reconfigurable fabric; queuing the memory request within a first FIFO for execution by an external memory direct memory access (DMA) master; issuing the memory request to an external memory, wherein the external memory is associated with the external memory DMA master; accessing the external memory based on the memory request; and transferring data between the external memory and a second cluster within the reconfigurable fabric.

Each of the above methods may be executed on one or more processors on one or more computer systems. Embodiments may include various forms of distributed computing, client/server computing, and cloud based computing. Further, it will be understood that the depicted steps or boxes contained in this disclosure's flow charts are solely illustrative and explanatory. The steps may be modified, omitted, repeated, or re-ordered without departing from the scope of this disclosure. Further, each step may contain one or more sub-steps. While the foregoing drawings and description set forth functional aspects of the disclosed systems, no particular implementation or arrangement of software and/or hardware should be inferred from these descriptions unless explicitly stated or otherwise clear from the context. All such arrangements of software and/or hardware are intended to fall within the scope of this disclosure.

The block diagrams and flowchart illustrations depict methods, apparatus, systems, and computer program products. The elements and combinations of elements in the block diagrams and flow diagrams, show functions, steps, or groups of steps of the methods, apparatus, systems, computer program products and/or computer-implemented methods. Any and all such functions—generally referred to herein as a “circuit,” “module,” or “system”—may be implemented by computer program instructions, by special-purpose hardware-based computer systems, by combinations of special purpose hardware and computer instructions, by combinations of general purpose hardware and computer instructions, and so on.

A programmable apparatus which executes any of the above mentioned computer program products or computer-implemented methods may include one or more microprocessors, microcontrollers, embedded microcontrollers, programmable digital signal processors, programmable devices, programmable gate arrays, programmable array logic, memory devices, application specific integrated circuits, or the like. Each may be suitably employed or configured to process computer program instructions, execute computer logic, store computer data, and so on.

It will be understood that a computer may include a computer program product from a computer-readable storage medium and that this medium may be internal or external, removable and replaceable, or fixed. In addition, a computer may include a Basic Input/Output System (BIOS), firmware, an operating system, a database, or the like that may include, interface with, or support the software and hardware described herein.

Embodiments of the present invention are neither limited to conventional computer applications nor the programmable apparatus that run them. To illustrate: the embodiments of the presently claimed invention could include an optical computer, quantum computer, analog computer, or the like. A computer program may be loaded onto a computer to produce a particular machine that may perform any and all of the depicted functions. This particular machine provides a technique for carrying out any and all of the depicted functions.

Any combination of one or more computer readable media may be utilized including but not limited to: a non-transitory computer readable medium for storage; an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor computer readable storage medium or any suitable combination of the foregoing; a portable computer diskette; a hard disk; a random access memory (RAM); a read-only memory (ROM), an erasable programmable read-only memory (EPROM, Flash, MRAM, FeRAM, or phase change memory); an optical fiber; a portable compact disc; an optical storage device; a magnetic storage device; or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.

It will be appreciated that computer program instructions may include computer executable code. A variety of languages for expressing computer program instructions may include without limitation C, C++, Java, JavaScript™, ActionScript™, assembly language, Lisp, Perl, Tcl, Python, Ruby, hardware description languages, database programming languages, functional programming languages, imperative programming languages, and so on. In embodiments, computer program instructions may be stored, compiled, or interpreted to run on a computer, a programmable data processing apparatus, a heterogeneous combination of processors or processor architectures, and so on. Without limitation, embodiments of the present invention may take the form of web-based computer software, which includes client/server software, software-as-a-service, peer-to-peer software, or the like.

In embodiments, a computer may enable execution of computer program instructions including multiple programs or threads. The multiple programs or threads may be processed approximately simultaneously to enhance utilization of the processor and to facilitate substantially simultaneous functions. By way of implementation, any and all methods, program codes, program instructions, and the like described herein may be implemented in one or more threads which may in turn spawn other threads, which may themselves have priorities associated with them. In some embodiments, a computer may process these threads based on priority or other order.

Unless explicitly stated or otherwise clear from the context, the verbs “execute” and “process” may be used interchangeably to indicate execute, process, interpret, compile, assemble, link, load, or a combination of the foregoing. Therefore, embodiments that execute or process computer program instructions, computer-executable code, or the like may act upon the instructions or code in any and all of the ways described. Further, the method steps shown are intended to include any suitable method of causing one or more parties or entities to perform the steps. The parties performing a step, or portion of a step, need not be located within a particular geographic location or country boundary. For instance, if an entity located within the United States causes a method step, or portion thereof, to be performed outside of the United States then the method is considered to be performed in the United States by virtue of the causal entity.

While the invention has been disclosed in connection with preferred embodiments shown and described in detail, various modifications and improvements thereon will become apparent to those skilled in the art. Accordingly, the forgoing examples should not limit the spirit and scope of the present invention; rather it should be understood in the broadest sense allowable by law. 

What is claimed is:
 1. A processor-implemented method for data manipulation comprising: initiating a memory request from a first cluster within a reconfigurable fabric; queuing the memory request within a first FIFO for execution by an external memory direct memory access (DMA) master; issuing the memory request to an external memory, wherein the external memory is associated with the external memory DMA master; accessing the external memory based on the memory request; and transferring data between the external memory and a second cluster within the reconfigurable fabric. 2-3. (canceled)
 4. The method of claim 1 wherein the external memory is outside the reconfigurable fabric.
 5. The method of claim 1 wherein the transferring data is facilitated by a second FIFO.
 6. The method of claim 5 wherein the queuing is accomplished by a queue manager, the first FIFO, and the second FIFO.
 7. The method of claim 6 wherein the first FIFO is controlled by a first circular buffer.
 8. The method of claim 7 wherein the first circular buffer provides instructions for unloading data from the first FIFO to the reconfigurable fabric.
 9. The method of claim 7 wherein a read for the first FIFO is based on a clock for the first circular buffer.
 10. The method of claim 7 wherein a write clock for the first FIFO is based on a clock for the external memory.
 11. The method of claim 7 wherein the second FIFO is controlled by a second circular buffer.
 12. The method of claim 5 wherein the transferring data comprises a read operation from the external memory.
 13. The method of claim 5 wherein the transferring data comprises a write operation to the external memory.
 14. The method of claim 1 wherein the transferring data comprises a credit-based flow control.
 15. The method of claim 14 wherein the credit-based flow control is based on a credit counter.
 16. The method of claim 1 wherein the queuing prevents back-pressure on the first FIFO.
 17. The method of claim 1 wherein the first cluster within the reconfigurable fabric and the second cluster within the reconfigurable fabric are each controlled by one or more circular buffers.
 18. (canceled)
 19. The method of claim 17 wherein the first cluster within the reconfigurable fabric initiates a memory request based on contents of one of the one or more circular buffers.
 20. The method of claim 17 wherein the second cluster within the reconfigurable fabric is a receiving cluster for the transferring data.
 21. The method of claim 17 wherein the second cluster receives data from external memory outside the reconfigurable fabric.
 22. The method of claim 17 wherein the second cluster within the reconfigurable fabric is a sending cluster for the transferring data.
 23. The method of claim 22 wherein the second cluster sends data to external memory outside the reconfigurable fabric.
 24. The method of claim 1 wherein the first cluster within a reconfigurable fabric and the second cluster within the reconfigurable fabric are synchronized within the reconfigurable fabric.
 25. The method of claim 1 wherein the external memory comprises a hybrid memory cube (HMC).
 26. The method of claim 25 wherein a second reconfigurable fabric is connected to the reconfigurable fabric through the HMC.
 27. The method of claim 1 wherein each cluster within a reconfigurable fabric is self-timed.
 28. (canceled)
 29. The method of claim 1 wherein each cluster within a reconfigurable fabric is synchronized to be within a tic cycle of each other cluster within a reconfigurable fabric.
 30. The method of claim 1 wherein the data includes valid data.
 31. The method of claim 30 wherein the valid data wakes a processing element when the valid data arrives at the processing element.
 32. (canceled)
 33. A computer program product embodied in a non-transitory computer readable medium for data manipulation comprising code which causes one or more processors to perform operations of: initiating a memory request from a first cluster within a reconfigurable fabric; queuing the memory request within a first FIFO for execution by an external memory direct memory access (DMA) master; issuing the memory request to an external memory, wherein the external memory is associated with the external memory DMA master; accessing the external memory based on the memory request; and transferring data between the external memory and a second cluster within the reconfigurable fabric.
 34. A computer system for data manipulation comprising: a memory which stores instructions; one or more processors attached to the memory wherein the one or more processors, when executing the instructions which are stored, are configured to: initiate a memory request from a first cluster within a reconfigurable fabric; queue the memory request within a first FIFO for execution by an external memory direct memory access (DMA) master; issue the memory request to an external memory, wherein the external memory is associated with the external memory DMA master; access the external memory based on the memory request; and transfer data between the external memory and a second cluster within the reconfigurable fabric. 